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Abstract 

Association mapping can quickly and efficiently dissect complex agronomic traits. Rapeseed is one of the 
most economically important polyploid oil crops, although its genome sequence is not yet published. In this 
study, a recently developed 60K Brassica Infinium® SNP array was used to analyse an association panel with 
472 accessions. The single-nucleotide polymorphisms (SNPs) of the array were in silico mapped using 'pseu- 
domolecules' representative of the genome of rapeseed to establish their hypothetical order and to perform 
association mappingof seed weight and seed quality. As a result, two significant associations on A8 and C3 of 
Brassica napus were detected for erucic acid content, and the peak SNPs were found to be only 233 and 
1 28 kb away from the key genes BnaA.FAEI and BnaC.FAEI. BnaAJAEI was also identified to be significantly 
associated with the oil content. Orthologues of Arabidopsis thaliana HAG 1 were identified close to four clus- 
ters of SNPs associated with glucosinolate content on A9, C2, C7 and C9. For seed weight, we detected two 
association signals on A7 and A9, which were consistent with previous studies of quantitative trait loci 
mapping. The results indicate that our association mapping approach is suitable for fine mapping of the 
complex traits in rapeseed. 
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1 . Introduction 

Rapeseed (Brassica napus L; AACC, 2n = 38) is the 
one of the most important oil crops worldwide. 
Brassica napus is a recent allopolyploid species derived 
from natural interspecies hybridization between two 
phylogenetically close species, B. rapa (AA, 2n = 20) 
and B. oleracea (CC, 2n = 1 8) <1 0 000 years ago.'-^ 
Rapeseed has a short domestication history of only 
~400-500 years.^"^ During the breeding history of 
rapeseed, the most outstanding event is the introduction 



of two traits, zero seed erucic acid and low seed glu- 
cosinolate content (so called double-low, canola 
quality 00). However, many traits such as oil content, 
seed yield, disease resistance, etc. urgently need to be 
further improved in the modern cultivars. Molecular 
design breeding is currently one of the most available 
breeding methods. Screening-specific materials with 
some desired traits in a comparatively genetically 
diverse source of germplasm including old genotypes 
with high levels of erucic acid and seed glucosino- 
lates and discovering the advantageous allelic variants 



©The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. 

This is an Open Access article distributed under the terms of the Creative Com mens Attribution Non-Commercial License (http://creativecommons.org/ 
licenses/by-nc/4.0/),which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. 
For commercial re-use, please contactjournals.permissions@oup.com 



356 



Association Study Using SNP Array in Rapeseed 



[Vol. 2 1 , 



by genetic analysis may advance molecular design 
breeding. 

Genetic mapping of quantitative trait loci (QTL) in 
rapeseed is well established and has been employed 
to localize QTLs for quantitative traits, such as oil 
content,^ glucosinolate content/ fatty acid compos- 
ition of the seed oil/ flowering time/ yield compo- 
nents/ etc. In all these studies, although QTL 
mapping is very successful in detecting QTL, the 
genetic variation in the mapping population is 
restricted to only the two parents, and the markers for 
detected QTL are not necessarily transferable to other 
materials. Furthermore, the ability of fine mapping a 
QTL in these studies is limited by the frequency of poly- 
morphic loci between the two parents and requires a 
population consisting of several thousands of indivi- 
duals, making it laborious and time-consuming to 
perform. There is, so far, no report on map-based 
cloning of a causal gene in a QTL of rapeseed. 

Association mapping, also called linkage disequilib- 
rium (LD) mapping, which directly studies statistical 
associations between genetic markers and phenotypes 
in natural populations, is an alternative to QTL map- 
ping.^ An association mapping study utilizes the 
higher number of historical recombination events 
that have occurred throughout the entire evolutionary 
history of the mapping population, allowing fine-scale 
QTL mapping.^ '^'^ ^ With the rapid developments in gen- 
omics and dramatically decreasing cost of genotyping 
technology, association mapping has rapidly became a 
promising approach for the genetic dissection of 
complex traits. For a species with available genome 
sequences such as Arabidopsis thaliana, rice, maize, 
etc., genome-wide association studies (GWASs) have 
contributed to revealing rich genetic architectures of 
complex traits.' 

Since rapeseed contains two homologous but di- 
vergent subgenomes A and C, which were shaped 
by whole-genome triplication followed by extensive 
diploidization,^^'^^ its genome structure is complex, 
which has hindered genomic research and high 
throughput discovery of high-quality molecular 
markers. The population structure, LD and association 
mappingin rapeseed were studied usingamplified frag- 
ment length polymorphism and simple sequence repeat 
markers.^"^"^^ However, due to the limited number of 
DNA markers and low-efficient genotyping technologies, 
a small number of lines or markers were used in these 
studies. Recent advances in sequencing and computation- 
al technology have enabled the discovery and efficient 
assay of large numbers of single-nucleotide polymorph- 
ism (SNP) markers in rapeseed. ^°'^' Delourme et al.^^ 
analysed the genetic diversity and LDofa rapeseed col- 
lection of 313 inbred lines from different geographical 
origins using >4300 SNPs that were localized on an 
integrated map of rapeseed. The genomic research of 



Brassica genera has developed rapidly in recent years. 
The Brassica A genome sequence from B. rapa has 
been published, and Brassica C genome sequencing 
from B. oleracea has also been completed and will be 
published soon (http://www.ocri-genomics.org/bolbase/). 
Qwingto the complicated genome of 6. napus, genome 
sequencing has notyet been completed. Based on a high- 
density (~21 K) SNP linkage map of rapeseed,^' Harper 
etal.^'* refined the order and orientation of the A and C 
genome sequence scaffolds, constructed pseudomole- 
cules representative of the 1 9 chromosomes of 6. napus 
and successfully carried out associative transcriptomics 
of glucosinolate and erucic acid content in a population 
of 53 B. napus lines. 

In this study, we genotyped a panel of 472 rapeseed 
accessions from all over the world using a 60K Brassica 
Infinium® SNP array recently developed by the inter- 
national Brassica lllumina SNP consortium. By in silico 
mapping of the SNPs of the array to 'pseudomolecules' 
representative of the 6. napus genome to obtain their 
hypothetical position, we performed a GWAS of seed 
weight and seed quality in B. napus. The SNPs signifi- 
cantly associated with traits were identified, and the 
candidate genes were confirmed or predicted. 

2. Materials and methods 

2.1. Plant materials 

A set of 472 rapeseed inbred lines collected from the 
National Mid-term Gene Bank for Oil Crops of China 
we re used forassociation analysis. According to the infor- 
mation from the gene bank and our own observations, 
the accessions were assigned to five different germplasm 
types, i.e. winter oilseed rape (OSR) (1 60), semi-winter 
OSR (200), spring OSR (1 1 0), spring fodder (1) and 
winterfodder (1 ). Based on theirorigins, 266 accessions 
originated from Asia, 1 28 from Western Europe, 2 Ofrom 
Oceania, 26 from North America and 32 from Eastern 
Europe (Supplementary Table SI ). For SNP array quality 
control, 1 2 doubled haploid (DH) lines (Supplementary 
Table S2), a resynthesized 6. napus SBn8, a cultivar 
Zhongshuang9 and their F, hybrid named F1ZSSB were 
employed. 

2.2. Experiment design and traits measurement 

One representative plant of each accession of the as- 
sociation population was self-pollinated in the 2010/ 
201 1 winter-spring growing season, and the leaves 
of each plant were sampled to extract genomic DNA. 
The self-pollinated seeds of each accession were 
grown in the experimental farm at Wuhan (1 1 4.3 1 °E, 
30.52°N), China, in the 201 1 /201 2 growing season. 
Each accession was grown in a plot with three rows 
and 10-12 plants in each row, with a distance of 
0.2 m between plants within each row and 0.3 m 
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between rows. Before flowering time, each plot was 
separated by nylon net with a 0.3 5-mm square hole 
to propagate seeds. Fatty acid analysis was performed 
by using gas liquid chromatography (GC) with a 
Model 6890N GC analyzer (Agilent Technologies, Inc., 
Wilmington, DE, USA), following the protocol as 
described.^^ The erucic acid content was expressed as 
its percentage of total fatty acids in mature seeds. 

The 472 accessions of the association population 
were grown using a randomized complete block 
design with three replications in the experimental 
farm at Wuhan, and Nanchang (1 1 6.27°E, 28.37°N), 
China, in the 201 2/201 3 growing season. Each acces- 
sion was grown in a plot with five rows with the same 
plant density abovementioned. The weight of randomly 
selected 1 000 well-filled, open-pollinated seeds of 
each plot was measured to represent seed weight for 
one replicate of each accession. Seed oil content and 
total glucosinolate content were estimated by near-in- 
frared reflectance spectroscopy (NIRS). Approximately 
3 g seeds per accession were measured using a Foss 
NIRS Systems 5000 instrument on a reflectance scan- 
ning mode. 

As the traits of the association panel were investi- 
gated in multi-environments with three replications, 
an R script (www.eXtension.org/pages/61006) based 
on a linear model described by Merk et al. was used 
to obtain the best linear unbiased prediction (BLUP) 
of each trait of each line. The resulting values were 
used as phenotypes for the association analysis. 

2.3. SNP genotyping and filtering 

SNP genotyping was performed using the Brassica 
60K lllumina® Infinium SNP array by Emei Tongde Co. 
(Beijing) according to the manufacture's protocol 
(http://www.illumina.com/technology/infinium_hd_ 
assay.iimn). The SNP data were clustered and called 
automatically using the lllumina BeadStudio genotyp- 
ing software. Those SNPs with AA or BB frequency 
equal to zero, call frequency <0.8 or minor frequency 
< 0.0 5 were excluded. The remaining SNPs were scruti- 
nized visually and those SNPs that did not show three 
clearly defined clusters representing the three possible 
genotypes (AA, AB and BB) were also excluded. 

2.4. \n s\\\co mapping of SNPs 

The source sequences for designing SNP probes of the 
SNP array were used to perform a BLAST^^ search 
against version 4 of the 'pseudomolecules' representa- 
tive of the genome of B. napus (Supplementary Table 
S3). Only the top BLAST hits against the pseudomole- 
cules were considered. BLAST matches to multiple 
loci, with the same top identity, were not considered 
to be mapped. 



2.5. Genetic diversity, population structure and linkage 
disequilibrium analysis 
Polymorphism information content (PIC) of the SNPs 
were estimated using the PowerMarker version 3.51 .^^ 
The differences of PIC between linkage groups were 
assessed usingone-way analysis of variance implemen- 
ted in SPSS 9.0. All the SNPs were used to estimate the 
genetic relatedness between individuals by principal 
component analysis (PCA) using the GCTA tool.^^ A 
total of 3900 SNPs [minor allele frequency (MAF) 
> 0.2] evenly distributed across the whole genome 
were selected to perform the following population 
structure and relative kinship analysis. The software 
package STRUCTURE v2.3.4'^° was used to infer popula- 
tion structure. Five independent runs were performed 
with a /<-value (the putative number of genetic groups) 
varying from 1 to 1 0, with the length of burnin period 
and the number of MCMC (Markov Chain Monte 
Carlo) replications after burnin both to 1 00 000 under 
the 'admixture model'. The most likely /<;-value was deter- 
mined by the log probability of data [LnP(D)] and an ad 
hoc statistic A/<: based on the rate of change of LnP(D) 
between successive k as described by Evanno et al.'^^ 
Theclustermembershipcoefficient matrices of replicate 
runsfrom STRUCTURE were integrated to get a Q matrix 
by the CLUMPP software'*^ and graphically displayed 
using the DISTRUCT software package."^^ Accessions 
with the probability of membership >0. 7 were 
assigned to corresponding clusters, and those <0.7 
were assigned to a mixed group. Nei's genetic distance "^"^ 
was estimated and used for constructing an unrooted 
neighbour-joining tree by the PowerMarker software, 
and the tree was visualized using FigTree (http://tree. 
bio.ed.ac.uk/software/figtree/). The relative kinship 
matrix comparingall pairsofthe472 accessions was cal- 
culated using the software package SPAGeDi."*^ Negative 
values between two individuals were set to 0, as 
described by Yu et al.'^^ LD was estimated as the 
squared allele frequency correlations (r^) between all 
pairs of the SNPs, using the software TASSEL3.0.'^^ 

2.6 Genome-wide association analysis 

Trait-SNP association analysis was performed using 
six models to evaluate the effects of population struc- 
ture (Q, PC) and kinship (K): (i) naive— without control- 
ling for Q and K, (ii) Q model, controlling for Q, (iii) PCA 
model, controlling for PC, the top 1 0 principal compo- 
nents were used as fixed effects, (iv) K model, control- 
ling for K, (v) Q + K model, controlling for both Q and 
K and (vi) P + K model, controlling for both PC and 
K. The naive, Q and PCA models were performed using 
a general linear model; the K, Q + K and P + K models 
were performed using a mixed linear model with 
optimum compression and population parameters pre- 
viously determined (P3D) variance component 
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estimation in TASSEL 3.0.'^*^''^^ The distribution of 
observed -logioCp) for each SNP from marker-trait 
associations was compared with the expected distribu- 
tion in aquantile-quantile plot. Significance of associa- 
tions between SNPs and traits was based on threshold 
p < 2 X 1 (i.e. -logio(p) = 5.7), a stringent 
Bonferroni correction calculated by dividing 0.05 by 
the total number of SNPs in the analysis, 24 256. The 
significant association of more than one SNP was 
regarded as a true association. Furthermore, false dis- 
covery rates (FDRs) were calculated as [(m x P)/n] x 
1 00%,'^'''^° where m is the total number of SNPs (i.e. 
24 2 56), P is the p-value threshold for detecting signifi- 
cant association (i.e. 2 x 10~^) and n is the total 
number of significant associations per trait. 



3. Results 

3.7. Phenotypic variations of measured quantitative 
traits 

Glucosinolate content, oil content and seed weight of 
the B. napus association panel comprising 472 acces- 
sions were measured for three replications in two loca- 
tions. As the key genes responsible for the natural 
variation of the erucic acid content are known, this 
trait was investigated foronly one replication in one lo- 
cation and used for evaluating the ability and accuracy 
of the GWAS of this panel. Extensive phenotypic varia- 
tions were observed as shown in the descriptive statis- 
tics in Table 1 . The erucic acid content, which varied 
from 0 to 53.7 with an average of 22.7, had the 
maximum coefficient of variation of 73.4%, whereas 
oil content in Nanchan, which varied from 33.4 to 
52.6 with an average of 43.6, had the lowest coefficient 
of variation of 5.4%. Seed weight and glucosinolate 
content were quite consistent across different geo- 
graphic location replicates with the correlation coeffi- 
cient of 0.876 and 0.914, respectively, and oil 
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content has the correlation coefficient of ~0.775 
(Supplementary Fig. SI ). 

3.2. SNP performance and quality 

Calling SNP genotype data using the BeadStudio 
genotyping software generally produced three clear 
clusters, i.e. the AA homozygote, BB homozygote and 
AB heterozygote. Of the 52 1 57 SNPs in the array, 1 0 
389 which had zero call frequency of AA or BB were 
excluded. Then, with a cut-off of missing data >0.2 
andMAF<0.05, 1 866 and 1 428 SNPs were filtered, re- 
spectively, reducing the number of SNPs to 38 474. 
Those that remained were scrutinized visually and 29 
02 7 SNPs with three clearly defined clusters were 
selected. Theoretically, the DH lines are homozygous 
throughout their genome, while the 12 DH lines 
derived from geographically different inbred lines of 6. 
napus showed a heterozygosity rate of 1.8~2.8%, 
which might be caused by miss-calling or homologous 
sequences and should be excluded. As a result, 21 86 
SNPs with the heterozygous genotype in any DH lines 
were filtered and 26 841 SNPs were finally selected. 

Tofurthercontrol SNP quality, a parent/Fi tripletand 
four DNA duplicates were used to analyse the pedigree 
consistency and technical reproducibility, respectively. 
As a result, the allele calls were highly reproducible 
(Supplementary Table S4) with no or negligible incon- 
sistencies between technical replicates. Furthermore, 
the parent/Fi triplet showed a pedigree inconsistency 
of <1%, confirming the high quality of the genotype 
calling of the selected SNPs in the array. 

3.3. SNP in silico mapping and diversity 

By BLAST analysis of the resource sequences of the 
SNPs in the array against 'pseudomolecules' representa- 
tive of the genome of 6. napus, 47 805 of 52 1 57 SNPs 
were in silico mapped (Supplementary Table S3). 
Among the selected 26 841 SNPs with fine perform- 
ance, 24 2 56 were mapped (Supplementary Table S5 



Association Study Using SNP Array in Rapeseed 



Table 1 . Phenotypic variations for seed weight and seed quality in this B. napus panel 



Traits 


Min ± SD'' 


Max : 


tSD 


Mean + SD 


CV (%)'' 


Glucosinolate content (jjimol/g)— Wuhan 201 3 


28.5 + 2.0 


142.8 


+ 0.7 


91.5 ± 0.3 


29.4 


Glucosinolate content (jjimol/g)— Nanchan 201 3 


30.1 + 0.9 


1 38.0 


+ 3.6 


90.1 + 0.9 


27.4 


Oil content (%)— Wuhan 2013 


34.2 + 0.8 


51.4 


+ 0.4 


42.6 + 0.3 


5.9 


Oil content (%)— Nanchan 201 3 


33.4 + 1.9 


52.6 


+ 1.5 


43.6 ± 0.7 


5.4 


Erucic acid content (%)— Wuhan 201 2 


0 


53.7 




22.7 


73.4 


Seed weight (g)— Wuhan 201 3 


2.3 ± 0.1 


5.9 


+ 0.2 


3.5 ± 0.0 


14.2 


Seed weight(g)— Nanchan 2013 


2.3 ± 0.1 


5.2 


+ 0.2 


3.6 ± 0.1 


13.8 



^SD is an abbreviation of standard deviation, which was calculated based on the measured values of seeds from three replicated 
experimental blocks. 

''CV is an abbreviation of coefficient of variation, which was estimated as the ratio of the standard deviation to the mean of all 
accessions. 



No. 4] 



F. Li et al. 



359 



and Fig. S2). Linkage group A2 had the least markers of 
506 with a marker density of one per 52.9 kb, and C4 
had the most markers of 21 04 with a marker density 
of one per 2 5.0 kb. The SNP diversity in the whole col- 
lection was expressed by a PIC value, and the results 
were shown in Supplementary Table S5 and summar- 
ized in Table 2. Except for linkage groups C5, C7 and 
C8, 7 5% of SNPs in each linkage group had PIC values 
over 0.2 5. The levels of polymorphism of linkage 
groups in the A subgenome (with the exception of A2, 
A4 and A5) were generally higher than those in the C 
subgenome. The mean PIC values of A8 and A9 were 
highest, nearly 0.32, whereas that of C7 was lowest, 
only 0.254. 

3.4. Population structure, relative kinship and linkage 
disequilibrium 
The population structure of the association panel was 
calculated using 3900 SNPs by STRUCTURE, and clus- 
tering inference performed with possible clusters {l<) 
from 1 to 1 0 showed that the most significant change 
of likelihood occurred when /< increased from 3 to 4, 
and the highest A/^ value was observed at /<; = 3 
(Fig. 1 ). Both parameters suggested thatthe 472 geno- 
types could be assigned into three groups. Using a 



probability of membership threshold of 70%, 1 64, 2 9 
and 1 7 lines were assigned into the three groups, re- 
spectively. The remaining 2 62 lines were classified 
into a mixed group (Supplementary Table SI ). Most of 
the lines in Group 1 were from Asia and belong to 
semi-winter OSR (Supplementary Table S6). European 
lines were the main component of the Groups 2 and 
3, in which the larger part were winter OSR and spring 
OSR, respectively. The NJ phylogenetic tree based on 
Nei's genetic distances displayed three clear clades 
(Supplementary Fig. S3), corresponding to the three 
groups estimated by STRUCTURE. The lines belong to 
mixed groups distributed across the whole tree. The 
PCA based on the 24 2 56 genome-wide SNPs showed 
that the first two principal components explained 
1 1 .7 and 5.8% of the genetic variance, respectively 
(Supplementary Fig. S4). The three major groups iden- 
tified using STRUCTURE (i.e. except for the mixed 
group), winter OSR, spring OSR and semi-winter OSR 
were assigned to three major clusters. The analysis of 
relative kinship showed that the average relative 
kinship between any two lines was 0.0856. A total of 
51.4% of kinship coefficients between lines were 
equal to 0, and 32.6% kinship coefficients ranged 
from 0 to 0.2 (Supplementary Fig. S5). This pattern of 
genetic relatedness revealed that most lines have no 



Table 2. Summary of the PIC values in different linkage groups of the 6. napus 



Linkgae 


Number 


PIC 


value 
























group 


of SNPs 


0.05-0.1 


0.1 


-0.1 5 


0.1 E 


-0.2 


0.2 


-0.25 


0.25-0.3 


0.3-0.35 


0.35-0.4 


Average 

pic'' 


Al 


1 072 


9 


(0.8%) 


34 


(3.2%) 


83 


[7.7%) 


1 1 7 


(1 0.9%) 


1 02 


(9.5%) 


248(23.1%) 


479 (44.7%) 


0.31 g 


A2 


506 


2 


(0.4%) 


34 


(6.7%) 


25 


(4.9%) 


46 


(9.1%) 


1 1 5 


(22.7%) 


1 30 (25.7%) 


1 54 (30.4%) 


0.294 cd 


A3 


1469 


6 


(0.4%) 


45 


(3.1%) 


93 


(6.3%) 


1 30 


(8.8%) 


210 


(14.3%) 


383 (26.1%) 


602 (41%) 


0.311 g 


A4 


1035 


6 


(0.6%) 


61 


(5.9%) 


56 


;5.4%) 


131 


(1 2.7%) 


209 


(20.2%) 


272 (26.3%) 


300 (29%) 


0.293 bed 


A5 


1 1 23 


1 


(0.1%) 


35 


(3.1%) 


69 


;6.i%) 


1 1 5 


(1 0.2%) 


1 79 


(1 5.9%) 


376 (33.5%) 


348 (31%) 


0.307 fg 


A6 


1099 


3 


(0.3%) 


36 


(3.3%) 


58 


(5.3%) 


91 


(8.3%) 


1 68 


(1 5.3%) 


317 (28.8%) 


426 (38.8%) 


0.31 2 g 


A7 


1427 


1 1 


(0.8%) 


50 


(3.5%) 


70 


(4.9%) 


1 26 


(8.8%) 


217 


(1 5.2%) 


382 (26.8%) 


571 (40%) 


0.31 g 


A8 


691 


7 


(1%) 


35 


(5.1%) 


38 


(5.5%) 


40 


(5.8%) 


53 


(7.7%) 


1 43 (20.7%) 


375 (54.3%) 


0.31 9 h 


A9 


1 225 


9 


(0.7%) 


45 


(3.7%) 


39 


;3.2%) 


68 


(5.6%) 


1 90 


(1 5.5%) 


261 (21.3%) 


61 3 (50%) 


0.32 h 


Al 0 


805 


8 


(1%) 


33 


(4.1%) 


38 


(4.7%) 


72 


(8.9%) 


1 04 


(1 2.9%) 


1 93 (24%) 


357 (44.3%) 


0.312g 


CI 


201 2 


1 


(0%) 


73 


(3.6%) 


1 24 


(6.2%) 


238 


(1 1.8%) 


298 


(14.8%) 


1 074 (53.4%) 


204 (10.1%) 


0.298 cde 


C2 


1 292 


2 


(0.2%) 


23 


(1.8%) 


54 


;4.2%) 


1 1 6 


(9%) 


397 


(30.7%) 


321 (24.8%) 


379 (29.3%) 


0.3 de 


C3 


2201 


29 


(1.3%) 


77 


(3.5%) 


1 59 


;7.2%) 


307 


(1 3.9%) 


352 


(1 6%) 


516 (23.4%) 


761 (34.6%) 


0.298 cde 


C4 


21 04 


32 


(1.5%) 


94 


(4.5%) 


1 63 


(7.7%) 


1 81 


(8.6%) 


351 


(1 6.7%) 


560 (26.6%) 


723 (34.4%) 


0.301 ef 


C5 


719 


2 


(0.3%) 


32 


(4.5%) 


59 


(8.2%) 


1 05 


(14.6%) 


1 10 


(1 5.3%) 


1 21 (1 6.8%) 


290 (40.3%) 


0.298 cde 


C6 


1 539 


28 


(1.8%) 


52 


(3.4%) 


101 


(6.6%) 


1 36 


(8.8%) 


403 


(26.2%) 


41 7 (27.1%) 


402 (26.1%) 


0.296 cde 


C7 


2006 


44 


(2.2%) 


591 


(29.5%) 


52 


(2.6%) 


1 85 


(9.2%) 


1 56 


(7.8%) 


587 (29.3%) 


391 (19.5%) 


0.254 a 


C8 


1096 


5 


(0.5%) 


23 


(2.1%) 


40 


(3.6%) 


362 


(33%) 


90 


(8.2%) 


229 (20.9%) 


347 (31.7%) 


0.292 be 


C9 


835 


14 


(1.7%) 


1 8 


(2.2%) 


78 


(9.3%) 


97 


(1 1.6%) 


227 


(27.2%) 


1 92 (23%) 


209 (25%) 


0.287 b 



^PIC is an abbreviation of polymorphism information content. 

''Values followed by different letters in this column are significantly different at the level of 0.05. 
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Figure 1. Analysis ofthe population structure of 472 rapeseed accessions by STRUCTURE, (a) Estimated LnP(D) of possible clusters (/<) from 1 to 
1 0;(b) A/< based on the rate of change of LnP(D) between successive /<; (c) population structure based on/< — 3. Each individual is represented 
by a vertical bar, partitioned into coloured segments with the length of each segment representing the proportion of the individual's 
genome. A given group is represented: Red, Group 1 ; Green, Group 2; Purple, Group 3. 



or weak kinship in this OSR panel. To assess the extent of 
LD, we calculated the mean pairwise for all SNPs as 
0.01 76. 



3.5. Association mapping 

We assessed the utility of sixstatistical models forcon- 
trollingtype I and II errors in this 6. napus panel. As can 
be seen in the QQ plots (Fig. 2), the distribution of 
observed -logio(p) values from the naive model 
departed quite far from the expected distribution 
leading to a high level of false-positive signals. For all 
four traits, any model controlling population structure 
or relative kinship performed significantly better than 
the naive model. The models controlling relative 
kinship, i.e. K,Q + Kand P + K, had similareffects in re- 
ducing the false positives, while P + K showed slightly 
greater effect than K and Q + K models. For erucic 
acid content and glucosinolate content, the PCA, K, 
Q+K and P + K models were better than the Q 
model. All of the models except the naive model per- 
formed well foroil content and seed weight, and no sig- 
nificant difference was observed among these five 
models for the seed weight. There were no significant 
associations between SNPs and oil content in the Q, 
P + K models based on threshold -logio(p) = 5.7, 
which might indicate false negatives in this analysis. 

AccordingtotheQ-Q plots ofthe six models, we used 
PCA and Q+K models to identify association signals. 



First, in order to evaluate the power of association ana- 
lysis in this panel, the erucic acid content, for which 
genetic control was well known, was first used to 
perform association analysis. Two significant regions 
located at 9.5 and 63.7 Mb of A8 and C3, respectively, 
in the 'pseudomolecules' of 6. napus were detected by 
both models (Fig. 3a, Supplementary Fig. S6a and 
Table 3). The FDRs for associations detected in Q + K 
and PCA models were 0.046 and 0.31 2%, respectively. 
The two GWAS peaks, at Bn-A08-pl 2 599446 and Bn- 
scaff_l 5794_3-p29807, contributed to 12.2 and 
7.9% of phenotypic variance, respectively, based on 
values. These two peak SNPs were found to be 233 and 
128kb away from the key genes BnaA.FAEI and 
BnaC.FAEl , respectively, indicating that our association 
genetics approach was successful. 

Four and five significant associations for glucosino- 
late content were detected by Q + K and PCA models, 
respectively, with the FDR of 0.083 and 0.1 06%. Both 
models detected four common regions at 3.2, 50.0, 
39.9 and 2.8 Mb of A9,C2,C7 and C9, respectively, in 
the 'pseudomolecules' of 6. napus (Fig. 3b,Supplemen- 
tary Fig. S6b and Table 3). The cumulative phenotypic 
variance explained by all significant SNPs was 56.7%. 
As the deletions of the A. thaiiana orthologous gene 
HAG) (At5g61420) in C2 and A9 were regarded to 
lead to low glucosinolate content,^"^ we searched the 
HAC1 paralogues in the 'pseudomolecules' of B. napus 
and found four paralogues at 3.4 Mb of A9, 49.6 Mb 



No. 4] 



F. Li et al. 



361 



(a) 70 Erucic acid content (b) 70 Glucosinolate content 




-log,o(P.,p.c.«.) -log,o(P.xp.c«d) 

Figure 2. Quantile-quantile plots of estimated -lDgio(p) from association analysis using six methodsforfour traits: (a) Erucic acid content; (b) 
glucosinolate content; (c) oil content and (d) seed weight. The black line is the expected line under the null distribution, and the deviations 
from expectation indicate that the statistical analysis may cause spurious associations. The horizontal dashed green lines indicates genome- 
wide significance threshold -logio(p)= 5.7. 



ofC2,40.7 MbofCZ and 4.7 MbofC9, which were very 
similartothefourGWAS peaks. The candidate geneBno- 
C.HAGIc was 1.9 Mb away from the peak SNP Bn- 
scaff_l 7526_1 -pi 140588, which might be due to 
the low density of SNPs around the candidate gene. In 
addition, the PCA model also detected a novel locus at 
6.33 Mb of C4, which contributed 5.0% of the pheno- 
typic variation (Supplementary Fig. S6b and Table 3). 
For oil content, only one significantly associated 
region in A8 was detected by Q + K and PCA models 
with FDRsof 2.5 and 0.71 4%, respectively (Fig. 3c, Sup- 
plementary Fig. S6c and Table S3). The GWAS peaks 
explained 6.1% of the total phenotypic variance. The 
GWAS peak for oil content was similar to that for 
erucic acid. For seed weight, Q + K and PCA models 
detected one association in A9 and A7 with the FDR of 
1.667 and 0.384%, respectively. The GWAS peak 
detected by Q + K, i.e. Bn-A09-p30654305, at 
34.65 Mb of A9, explained 1 1.4% of the phenotypic 
variance (Fig. 3d, Supplementary Fig. S6d and 
Table 3). The SNP Bn-A09-p30654305 was also 
detected by the PCA model as a single associated SNP 
in A9. In addition, using the PCA model, we also 



detected an associated SNP Bn-Al 0-pl 2639538 at 
2.67 Mb of A7 (Supplementary Fig. S5d and Table 3), 
which explained 4.9% ofthe total seed weight variation. 



4. Discussion 

High-quality molecular markers and reliable geno- 
type data are the precondition of genetic diversity ana- 
lysis and association mapping. The amphidiploid nature 
ofthe B. napus genome might make automated SNP de- 
tection challenging and confounding, with homologue, 
paralogue and also homoeologue variation. Trick 
etfl/.^°termed three types of SNPs, i.e. inter-homoeolo- 
gue polymorphisms, hemi-SNPs and simple SNPs in B. 
napus. The inter-homoeologue polymorphisms do not 
represent allelic variation per se, and as hemi-SNPs are 
the allelic polymorphisms in the homoeologous 
sequences, the short flanking sequences of the hemi- 
SNPs could not be specifically mapped in the 'pseudo- 
molecules' representative of the genome of B. napus. 
By scanning the rapeseed association panel using the 
60K SNP array, the clustering results showed that ~1 0 
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Figure 3. Manhattan plots of association analysis using the Q + K mode! for four traits: (a) erucic acid content; (b) glucosinolate content; (c) oil 
content and (d) seed weight. Each dot represents a SNP. The horizontal dashed green line represents the Bonferroni-corrected significance 
threshold -logTo(p) = 5.7. 



000 SNPs had zero call frequency of AA or BB, which 
could, in part, be caused by inter-homoeologue poly- 
morphisms. Meanwhile, ~1 0 000 SNPs showed more 
than three clusters, which are indicative of two loci in 
a duplicated sequence as suggested by Canal et al.^^ 
and are possibly hemi-SNPs. Bystrictlyfiltering,the gen- 
otypes of the finally selected SNPs of all DH lines were 
completely homozygous, and the DNA pedigree con- 
sistency and technical reproducibility were nearly 
perfect, suggesting the high quality of these selected 
SNPs. 

On the whole, over 76% of the SNPs have the PIC of 
0.2 5-0.5, which was regarded as the PIC range of 
medium polymorphism of the DNA marker.^^ On 
average, the polymorphism level was slightly higher 
for the A than for the C linkage groups in our population, 
contrary to the results of Delourmeet a/. In our panel, 
nearly half of the genotypes were Asiatic, while those of 
Delourme et al. were European.^^ Asiatic cultivars were 
selected for improved adaptation to the local environ- 
ment by introgression of 6. rapa into 6. napus germ- 
plasm, ^"^'^^ increasing the genetic diversity of the A 
genome. Our finding suggests that genetic diversity of 



6. napus germplasm should be broadened by introgres- 
sion of 6. o/eracefl, as described by previous studies.^"^'^^ 
Understanding the population structure of the asso- 
ciation mapping population contributes to reducing 
both Type I and II errors between molecular markers 
and traits of interest.'^^'^^ In our study, the association 
population was classified into three groups, semi- 
winter, winter and spring OSR were the main compo- 
nents of these three groups. Distinct clustering of 
winter and spring types has been reported by previous 
studies.^"^'^^'^^'^^ In our study, although Croups 2 and 
3 mainlydistinguishedwinterandspringtypes, respect- 
ively, some winter and spring types were found in the 
mixed group, located at an intermediate position 
between Croups 2 and 3. Delourme et al.^^ also 
observed that some spring OSR lines did not show dis- 
tinct clustering. These intermediate lines might be 
derived from hybrids of the lines from different gene 
pools.^°'^^ The kinship analysis revealed that most 
lines in the panel have no or weak kinship, together 
with significantdifferences in phenotype performance, 
indicatingthatthis panel is suitable for association ana- 
lysis. The mean pairwise values is close to previous 
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estimates of 0.01 1 7^*^ or 0.0247^"^ and lower than 
0.03 7 estimated for the population used by Delourme 
etfl/.,^^confirmingthe low overall level of LD in 6. napus. 
At the genome level, mean LD decay was estimated to 
be 0.5-1.2 cM in rapeseed germplasm.^'*'^^'^^ Given 
the length of the genetic map of 2 500 cM, >21 00- 
5000 evenly spaced markers would be necessary to 
perform GWASs in rapeseed. Therefore, by selecting in 
excess of 20 000 SNPs, our study should have more 
than sufficient markers to perform a good association 
analysis. 

In the previous association analysisstudy in rapeseed, 
generally only population structure estimated by 
STRUCTURE'^°wasconsidered.^^"^'''*^^However,spuri- 
ous associations cannot be controlled completely by 
population structure since the Q matrix only gives a 
rough dissection of population differentiation. Further- 
more, the programme STRUCTURE needs intensive 
computational cost on large data sets. The current 
PCA software such as EIGENSTRAT" and GCTA^'' 
could analyse tens of thousands of samples with mil- 
lions of SNPs with high performance and infercontinu- 
ous axes of genetic variation; therefore, the PCA based 
on genome-wide SNPs was broadly used to detect and 
correct for population stratification in GWAS;^^'^^ 
however, a few residual false-positive associations still 
existed in the PCA model. ^"^Yu etal. suggested that in- 
tegration of the pairwise kinship into a mixed model to 
correct for relatedness could reduce spurious associ- 
ation, which was subsequently supported by the study 
in maize,"^^ sorghum^^ and barley.^^ In our study, six 
models, i.e. naive, Q, PCA, K, Q + K and P + K models, 
were first compared for controlling the false positives 
in B. napus association mapping. The models consider- 
ing population structure or relatedness, especially 
PCA, K, Q+ Kand P + K models, could effectively elim- 
inate the excess of low p-values for all traits; however, 
the P + K model also likely eliminated true positives, 
which is a common problem seen in other systems as 
well.^^'^^ Therefore, in order to reduce the false posi- 
tives and false negatives, both Q + K and PCA models 
were considered to identify association signals. 

The position of the DNA markers relative to the 
genome sequence or linkage map is also preferable 
for successful association mapping. In this study, the 
SN Psofthe array were ms;7/co mapped in B. napus'pseu- 
domolecules' to establish their hypothetical order. The 
positions of the significant association signals detected 
in this study are consistent with previous studies. It 
has been shown that two homoeologues, namely 
BnaA.FAEI and BnaC.FAEI on linkage groups A8 and 
C3, respectively, are responsible for controlling erucic 
acid content.^^'^^ In the present study, the peaks of 
the two association signals for erucic acid content 
(represented by the markers Bn-A08-pl 2599446 
and Bn-scaff_l 5794_3-p29807, which have the 
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highest significance of association on A8 and C3, re- 
spectively) were within 233 kb of these two genes, indi- 
cating that our association approach was successful. 

For glucosinolate content, we identified four associ- 
ation signals on A9,C2,C7 and C9, which were detected 
independently in differentstudies.^'^°'^^ Howell etfl/. ^° 
regarded thattheQTLson A9,C2 and C9 were homoeo- 
logous loci, whereas the underlying control genes were 
not uncovered until Harper et al.^'^ identified genomic 
deletions that underlie the QTLs on A9 and C2. In the 
low-glucosinolate rapeseed accessions, the deleted seg- 
ments contained orthologs of the transcription factor 
HAG1 (At5g61 420),^"^ which controls aliphatic gluco- 
sinolate biosynthesis in A. thaliana/^ In Brassica 
7t/ncefl,silencingof the orthologs of /-//4C7 was reported 
to cause low glucosinolate content.^^ Interestingly, in 
the present study, four pea kSNPs associated with gluco- 
sinolate content were identified to be close to the para- 
log ues of /-//4G 7. Further studies are necessary to reveal 
the DNA variation of the other two paralogues of HAG 1 
on C7 and C9. 

Control of seed oil content is complex and previous 
studies have detected at least 1 0 QTLs, which were en- 
vironmentally sensitive and with minor effects.^'^"^^ 
These studies commonly regarded that two QTLs on 
A8 and C3 overlapped with the QTLs for erucic acid 
content. Since the BnaA.FAEI and BnaC.FAEI were 
cloned and confirmed as the key factors controlling 
the synthesis of erucic acid in rapeseed,^^'^^ it was 
regarded that the pleiotropy of FAEl was responsible 
for the decrease in seed oil content along with the re- 
duction of seed erucic acid content in the modern cul- 
tivars. Our study showed that the peak SNP on A8 was 
only 233 kb away from the BnaA.FAEI, indicating the 
role of fine mapping of GWAS. 

Seed weight of rapeseed is also a complex trait, which 
has a relatively high heritability and may primarily be 
controlled by genes with additive effects.^ ^ So far, 
> 80 QTLs across the 1 9 linkage groups have been iden- 
tified and individual QTL contributed to 2.0-2 8.2% of 
the phenotypic variation.^ The seed weight 
QTLs on A7 were repeatedly detected in many studies 
with diverse genetic materials,^ o-i 2,78,80 ^[^^ qj|_ 
on A9 was identified as a major QTL with the most con- 
tribution of 2 8.2% for the total seed weight variation.^ ^ 
Interestingly, our association mapping also detected a 
major QTL on A9 and a minor QTL on A7, indicating 
our association panel might be suitable for dissecting 
complex traits in 6. napus. Since associated markers 
and causal polymorphisms underlying traits of interest 
are in a narrow interval equivalent to the distance of LD 
decay,^^ the genes flanking the peak SNPs and falling 
within the window of LD decay were always used to 
predict the candidate genes.^^ As the average LD 
across our panel is very low (mean pairwise = 
0.01 76), it is reasonable to suppose that the detected 



loci in our study will be located in close proximity to 
the candidate genes controlling seed weight, and that 
these tightly associated SNPs would be of significant 
benefit in a molecular design breeding approach to 
improve seed weight. 

Supplementary Data: Supplementary data are 
available at www.dnaresearch.oxfordjournals.org. 
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